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Abstract —We consider the problem of identifying patterns in 
a data set that exhibit anomalous behavior, often referred to as 
anomaly detection. Similarity-based anomaly detection algorithms 
detect abnormally large amounts of similarity or dissimilarity, 
e.g. as measured by nearest neighbor Euclidean distances between 
a test sample and the training samples. In many application 
domains there may not exist a single dissimilarity measure 
that captures all possible anomalous patterns. In such cases, 
multiple dissimilarity measures can be defined, including non¬ 
metric measures, and one can test for anomalies by scalarizing 
using a non-negative linear combination of them. If the relative 
importance of the different dissimilarity measures are not known 
in advance, as in many anomaly detection applications, the 
anomaly detection algorithm may need to be executed multiple 
times with different choices of weights in the linear combination. 
In this paper, we propose a method for similarity-based anomaly 
detection using a novel multi-criteria dissimilarity measure, the 
Pareto depth. The proposed Pareto depth analysis (PDA) anomaly 
detection algorithm uses the concept of Pareto optimality to 
detect anomalies under multiple criteria without having to run an 
algorithm multiple times with different choices of weights. The 
proposed PDA approach is provably better than using linear 
combinations of the criteria and shows superior performance on 
experiments with synthetic and real data sets. 

Index Terms —multi-criteria dissimilarity measure, similarity- 
based learning, combining dissimilarities, Pareto front, scalariza- 
tion gap, partial correlation 


I. Introduction 

Identifying patterns of anomalous behavior in a data set, 
often referred to as anomaly detection, is an important prob¬ 
lem with diverse applications including intrusion detection in 
computer networks, detection of credit card fraud, and medical 
informatics @, 0. Similarity-based approaches to anomaly 
detection have generated much interest 0-© due to their 
relative simplicity and robustness as compared to model-based, 
cluster-based, and density-based approaches These 

approaches typically involve the calculation of similarities or 
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dissimilarities between data samples using a single dissimi¬ 
larity criterion, such as Euclidean distance. Examples include 
approaches based on A;-nearest neighbor (fc-NN) distances 0- 
|6|, local neighborhood densities |7|. local p-value estimates 
|8J, and geometric entropy minimization 0, (TO). 

In many application domains, such as those involving 
categorical data, it may not be possible or practical to rep¬ 
resent data samples in a geometric space in order to com¬ 
pute Euclidean distances. Furthermore, multiple dissimilarity 
measures corresponding to different criteria may be required 
to detect certain types of anomalies. For example, consider 
the problem of detecting anomalous object trajectories in 
video sequences of different lengths. Multiple criteria, such 
as dissimilarities in object speeds or trajectory shapes, can be 
used to detect a greater range of anomalies than any single 
criterion. 

In order to perform anomaly detection using these multiple 
criteria, one could first combine the dissimilarities for each 
criterion using a non-negative linear combination then apply 
a (single-criterion) anomaly detection algorithm. However, in 
many applications, the importance of the different criteria are 
not known in advance. It is thus difficult to determine how 
much weight to assign to each criterion, so one may have 
to run the anomaly detection algorithm multiple times using 
different weights selected by a grid search or similar method. 

We propose a novel multi-criteria approach for similarity- 
based anomaly detection using Pareto depth analysis (PDA). 
PDA uses the concept of Pareto optimality, which is the typical 
method for defining optimality when there may be multiple 
conflicting criteria for comparing items. An item is said to be 
Pareto-optimal if there does not exist another item that is better 
or equal in all of the criteria. An item that is Pareto-optimal 
is optimal in the usual sense under some (not necessarily 
linear) combination of the criteria. Hence PDA is able to detect 
anomalies under multiple combinations of the criteria without 
explicitly forming these combinations. 

The PDA approach involves creating dyads corresponding 
to dissimilarities between pairs of data samples under all of the 
criteria. Sets of Pareto-optimal dyads, called Pareto fronts, are 
then computed. The first Pareto front (depth one) is the set of 
non-dominated dyads. The second Pareto front (depth two) is 
obtained by removing these non-dominated dyads, i.e. peeling 
off the first front, and recomputing the first Pareto front of 
those remaining. This process continues until no dyads remain. 
In this way, each dyad is assigned to a Pareto front at some 
depth (see Fig. |T] for illustration). 

The Pareto depth of a dyad is a novel measure of dis¬ 
similarity between a pair of data samples under multiple 
criteria. In an unsupervised anomaly detection setting, the 
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Fig. 1. (a) Illustrative example with 40 training samples (blue x’s) and 2 test samples (red circle and triangle) in R 2 . (b) Dyads for the training samples (black 
dots) along with first 20 Pareto fronts (green lines) under two criteria: |Aa:| and Ay |. The Pareto fronts induce a partial ordering on the set of dyads. Dyads 
associated with the test sample marked by the red circle concentrate around shallow fronts (near the lower left of the figure), (c) Dyads associated with the 
test sample marked by the red triangle concentrate around deep fronts. 


majority of the training samples are assumed to be nominal. 
Thus a nominal test sample would likely be similar to many 
training samples under some criteria, so most dyads for the 
nominal test sample would appear in shallow Pareto fronts. 
On the other hand, an anomalous test sample would likely be 
dissimilar to many training samples under many criteria, so 
most dyads for the anomalous test sample would be located 
in deep Pareto fronts. Thus computing the Pareto depths of the 
dyads corresponding to a test sample can discriminate between 
nominal and anomalous samples. 


Under the assumption that the multi-criteria dyads can be 
modeled as realizations from a A'-dimensional density, we 
provide a mathematical analysis of properties of the first 
Pareto front relevant to anomaly detection. In particular, in the 
|Scalarization Gap Theorem] we prove upper and lower bounds 
on the degree to which the Pareto fronts are non-convex. 
For any algorithm using non-negative linear combinations of 
criteria, non-convexities in the Pareto fronts contribute to an ar¬ 
tificially inflated anomaly score, resulting in an increased false 
positive rate. Thus our analysis shows in a precise sense that 
PDA can outperform any algorithm that uses a non-negative 
linear combination of the criteria. Furthermore, this theoretical 
prediction is experimentally validated by comparing PDA 
to several single-criterion similarity-based anomaly detection 
algorithms in two experiments involving both synthetic and 
real data sets. 


The rest of this paper is organized as follows. We discuss 
related work in Section [II] In Section III we provide an intro¬ 
duction to Pareto fronts and present a theoretical analysis of 
the properties of the first Pareto front. Section[IV]relates Pareto 
fronts to the multi-criteria anomaly detection problem, which 
leads to the PDA anomaly detection algorithm. Finally we 
present three experiments in Section [V]to provide experimental 
support for our theoretical results and evaluate the performance 
of PDA for anomaly detection. 


II. Related work 

A. Multi-criteria methods for machine learning 

Several machine learning methods utilizing Pareto optimal¬ 
ity have previously been proposed; an overview can be found 
in ED . These methods typically formulate supervised machine 
learning problems as multi-objective optimization problems 
over a potentially infinite set of candidate items where finding 
even the first Pareto front is quite difficult, often requiring 
multi-objective evolutionary algorithms. These methods differ 
from our use of Pareto optimality because we consider Pareto 
fronts created from a finite set of items, so we do not need to 
employ sophisticated algorithms in order to find these fronts. 
Rather, we utilize Pareto fronts to form a statistical criterion 
for anomaly detection. 

Finding the Pareto front of a finite set of items has also been 
referred to in the literature as the skyline query ED- ED or 
the maximal vector problem ED- Research on skyline queries 
has focused on how to efficiently compute or approximate 
items on the first Pareto front and efficiently store the results 
in memory. Algorithms for skyline queries can be used in 
the proposed PDA approach for computing Pareto fronts. Our 
work differs from skyline queries because the focus of PDA 
is the utilization of multiple Pareto fronts for the purpose of 
multi-criteria anomaly detection, not the efficient computation 
or approximation of the first Pareto front. 

Hero and Fleury fl5) introduced a method for gene ranking 
using multiple Pareto fronts that is related to our approach. 
The method ranks genes, in order of interest to a biologist, 
by creating Pareto fronts on the data samples, i.e. the genes. 
In this paper, we consider Pareto fronts of dyads , which 
correspond to dissimilarities between pairs of data samples 
under multiple criteria rather than the samples themselves, and 
use the distribution of dyads in Pareto fronts to perform multi¬ 
criteria anomaly detection rather than gene ranking. 

Another related area is multi-view learning jl6), (T7), which 
involves learning from data represented by multiple sets of 
features, commonly referred to as “views”. In such a case, 
training in one view is assumed to help to improve learning in 













3 


another view. The problem of view disagreement, where sam¬ 
ples take on different classes in different views, has recently 
been investigated 1181. The views are similar to criteria in our 
problem setting. However, in our setting, different criteria may 
be orthogonal and could even give contradictory information; 
hence there may be severe view disagreement. Thus training in 
one view could actually worsen performance in another view, 
so the problem we consider differs from multi-view learning. 
A similar area is that of multiple kernel learning JT9J, which is 
typically applied to supervised learning problems, unlike the 
unsupervised anomaly detection setting we consider. 


B. Anomaly detection 

Many methods for anomaly detection have previously been 
proposed. Hodge and Austin (2| and Chandola et al. |3j 
both provide extensive surveys of different anomaly detection 
methods and applications. 

This paper focuses on the similarity-based approach to 
anomaly detection, also known as instance-based learning. 
This approach typically involves transforming similarities be¬ 
tween a test sample and training samples into an anomaly 
score. Byers and Raftery 0 proposed to use the distance 
between a sample and its fcth-nearest neighbor as the anomaly 
score for the sample; similarly, Angiulli and Pizzuti |5| and 
Eskin et al. [6j proposed to the use the sum of the distances 
between a sample and its fc nearest neighbors. Breunig et al. 0 
used an anomaly score based on the local density of the k 
nearest neighbors of a sample. Hero |9] and Sricharan and 
Hero JTO) introduced non-parametric adaptive anomaly detec¬ 
tion methods using geometric entropy minimization, based on 
random fc-point minimal spanning trees and bipartite fc-nearest 
neighbor (fc-NN) graphs, respectively. Zhao and Saligrama 
0 proposed an anomaly detection algorithm k-LPE using 
local p-value estimation (LPE) based on a fc-NN graph. The 
aforementioned anomaly detection methods only depend on 
the data through the pairs of data points (dyads) that define 
the edges in the fc-NN graphs. These methods are designed 
for a single criterion, unlike the PDA anomaly detection 
algorithm that we propose in this paper, which accommodates 
dissimilarities corresponding to multiple criteria. 

Other related approaches for anomaly detection include 1- 
class support vector machines (SVMs) [20) , where an SVM 
classifier is trained given only samples from a single class, 
and tree-based methods, where the anomaly score of a data 
sample is determined by its depth in a tree or ensemble of 
trees. Isolation forest HD and SCiForest [22 1 are two tree- 
based approaches, targeted at detecting isolated and clustered 
anomalies, respectively, using depths of samples in an ensem¬ 
ble of trees. Such tree-based approaches utilize depths to form 
anomaly scores, similar to PDA; however, they operate on 
feature representations of the data rather than on dissimilarity 
representations. Developing multi-criteria extensions of such 
non-similarity-based methods is beyond the scope of this paper 
and would be worthwhile future work. 

III. Pareto fronts and their properties 

Multi-criteria optimization and Pareto fronts have been stud¬ 
ied in many application areas in computer science, economics 


and the social sciences. An overview can be found in [23). 
The proposed PDA method in this paper utilizes the notion of 
Pareto optimality, which we now introduce. 

A. Motivation for Pareto optimality 

Consider the following problem: given n items, denoted 
by the set S , and d criteria for evaluating each item, de¬ 
noted by functions f t, f ( i, select x £ S that minimizes 
..., fd{x)]. In most settings, it is not possible to find 
a single item x which simultaneously minimizes fi(x) for all 
i £ {1, Many approaches to the multi-criteria opti¬ 

mization problem reduce to combining all of the criteria into 
a single criterion, a process often referred to as scalarization 
|23|. A common approach is to use a non-negative linear 
combination of the ff s and find the item that minimizes the 
linear combination. Different choices of weights in the linear 
combination yield different minimizers. In this case, one would 
need to identify a set of optimal solutions corresponding to 
different weights using, for example, a grid search over the 
weights. 

A more robust and powerful approach involves identifying 
the set of Pareto-optimal items. An item x is said to strictly 
dominate another item x* if x is no greater than x* in each 
criterion and x is less than x* in at least one criterion. This 
relation can be written as x >- x* if fi(x) < fi(x*) for each 
i and fi(x) < fi(x*) for some i. The set of Pareto-optimal 
items, called the Pareto front, is the set of items in S that 
are not strictly dominated by another item in S. It contains 
all of the minimizers that are found using non-negative linear 
combinations, but also includes other items that cannot be 
found by linear combinations. Denote the Pareto front by 
T\, which we call the first Pareto front. The second Pareto 
front can be constructed by finding items that are not strictly 
dominated by any of the remaining items, which are members 
of the set S\T i. More generally, define the vth Pareto front 
by 

(i-i 

T, = Pareto front of the set S \ 

V=i 

For convenience, we say that a Pareto front T, is deeper than 
Tj if i > j. 



B. Mathematical properties of Pareto fronts 

The distribution of the number of points on the first Pareto 
front was first studied by Bamdorff-Nielsen and Sobel [24) . 
The problem has garnered much attention since. Bai et al. [ |25| 
and Hwang and Tsai [26) provide good surveys of recent 
results. We will be concerned here with properties of the first 
Pareto front that are relevant to the PDA anomaly detection 
algorithm and have not yet been considered in the literature. 

Let Y \,..., Y n be independent and identically dis¬ 
tributed (i.i.d.) on with density function / : —> R, 

and let T" denote the first Pareto front of Y\..... Y n . In 
the general multi-criteria optimization framework, the points 
Y-f,..., Y n are the images in of n feasible solutions to some 
optimization problem under a vector of objective functions of 
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length d. In the context of multi-criteria anomaly detection, 
each point Y, is a dyad corresponding to dissimilarities be¬ 
tween two data samples under multiple criteria, and d = K is 
the number of criteria. 

A common approach in multi-objective optimization is lin¬ 
ear scalarization |23), which constmcts a new single criterion 
as a non-negative linear combination of the d criteria. It is 
well-known, and easy to see, that linear scalarization will only 
identify Pareto-optimal points on the boundary of the convex 
hull of 

g n := U (x + R d + ), 

X^iJ 771 


where R+ = {x £ R d | Vi, x i: > 0}. Although this is 
a common motivation for Pareto optimization methods, there 
are, to the best of our knowledge, no results in the literature 
regarding how many points on the Pareto front are missed by 
scalarization. We present such a result in this section, namely 


the Scalarization Gap Theorem 


We define 


|^J argmin < 

y, aiXi > 

, S n = {Y 1 ,.. 

■,Y n } 

.eRt * eS " 

U=i J 




The subset O n C T" contains all Pareto-optimal points that 
can be obtained by some selection of of non-negative weights 
for linear scalarization. Let I\ n denote the cardinality of T", 
and let L n denote the cardinality of O". When Yj,..., Y n are 
uniformly distributed on the unit hypercube, Barndorff-Nielsen 
and Sobel (24) showed that 

E ( R n) = J Q (1 — x) n ~ 1 (—\ogx) d ~ 1 dx, 

from which one can easily obtain the asymptotics 

E(Kn) = + 0 ((logn) d - 2 ). 


Many more recent works have studied the variance of K n 
and have proven central limit theorems for K n . All of these 
works assume that Yi,..., Y n are uniformly distributed on 


[0,1]“. For a summary, see 1251 and |26J. Other works have 
studied K n for more general distributions on domains that 
have smooth “non-horizontal” boundaries near the Pareto front 
j27) and for multivariate normal distributions on W l |28|. The 
“non-horizontal” condition excludes hypercubes. 

To the best of our knowledge there are no results on the 
asymptotics of K n for non-uniformly distributed points on the 
unit hypercube. This is of great importance as it is impractical 
in multi-criteria optimization (or anomaly detection) to assume 
that the coordinates of the points are independent. Typically 
the coordinates of Yj £ R rf are the images of the same feasible 
solution under several different criteria, which will not in 
general be independent. 

Here we develop results on the size of the gap between the 
number of items L n discoverable by scalarization compared to 
the number of items K n discovered on the Pareto front. The 
larger the gap, the more suboptimal scalarization is relative to 
Pareto optimization. Since x £ O" if and only if x is on the 
boundary of the convex hull of Q n , the size of O' is related 


to the convexity (or lack thereof) of the Pareto front. There 
are several ways in which the Pareto front can be non-convex. 

First, suppose that Yi,..., Y n are distributed on some 
domain 12 C R d with a continuous density function / : 12 —»• R 
that is strictly positive on V.. Let T C <91 1 be a portion of the 
boundary of 12 such that 

inf min(z/i (z),... ,v d (z)) >0, 

and 


{y £ 12 : Vi t/j < x.i} = {a;}, for all tGf, 

where v : <912 —> R d is the unit inward normal to <90. The 
conditions on T guarantee that a portion of the first Pareto 
front will concentrate near T as n —>■ 00. If we suppose that 
T is contained in the interior of the convex hull of O, then 
points on the portion of the Pareto front near T cannot be 
obtained by linear scalarization, as they are on a non-convex 
portion of the front. Such non-convexities are a direct result of 
the geometry of the domain 12 and are depicted in Fig. [2a] In a 
preliminary version of this work, we studied the expectation of 
the number of points on the Pareto front within a neighborhood 
of T (Theorem 1 in 0 )- As a result, we showed that 

E{K n - L n ) > 7 n— + 0(n~), 




f{z) * ( 17 {z)---v d {z)) d dz. 


It has recently come to our attention that a stronger result 
was proven previously by Baryshnikov and Yukich in an 
unpublished manuscript. 

In practice, it is unlikely that one would have enough 
information about / or 12 to compute the constant 7. In this 
paper, we instead study a second type of non-convexity in 
the Pareto front. These non-convexities are strictly due to 
randomness in the positions of the points and occur even when 


the domain 12 is convex (see Fig. 2b for a depiction of such 
non-convexities). In the following, we assume that Y \,..., Y n 
are i.i.d. on the unit hypercube [ 0 , l] d with a bounded density 
function / : [ 0 , l\ d —> R d which is continuous at the origin 
and strictly positive on [0, l] d . Under these assumptions on 
/, it turns out that the asymptotics of E(K n ) and E(L n ) 
are independent of /. Hence our results are applicable to a 
wide range of problems without the need to know detailed 
information about the density /. 

Our first result provides asymptotics on /(',, , the size of the 
first Pareto front. 


Theorem 1. Assume f : [0, l] d —» [a, M] is continuous at the 
origin, and 0 < a < M < 00. Then 


E(Kn') ~~ Cn.rf 


(logn) d 1 

(d- 1 )! 


as n-> 00 . 


The proof of Theorem |T] is provided in the Appendix. Our 
second result concerns E(L n ). We are not able to get the 
exact asymptotics of E{L n ), so we provide upper and lower 
asymptotic bounds. 
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(b) 


Fig. 2. |(a)| Non-convexities in the Pareto front induced by the geometry of 
the domain O. |(b)| Non-convexities due to randomness in the points. In each 
case, the larger points are Pareto-optimal, and the large black points cannot 
be obtained by scalarization. 

Theorem 2. Assume f : [0, l] rf —> [cj, M] is continuous at the 
origin, and 0 < a < M < oo. Then 

|rCn,d + o((logn) d_1 ) < E(L n ) < §gE^c n , d + o((logn) d_1 ) 
as n —> oo. 

Theorem [2] provides a significant generalization of a pre¬ 
vious result (Theorem 2 in Q) that holds only for uniform 
distributions in d = 2. The proof of Theorem|2]is also provided 
in the Appendix. 

Combining Theorems |T| and [2] we arrive at our main result: 

Theorem (Scalarization Gap Theorem). Assume f : [0, 1]' ( —>■ 
[er, TIT] is continuous at the origin, and 0 < cr < M < oo. 
Then 


TABLE I 

List of notation 


Symbol Definition 

K Number of criteria (dissimilarity measures) 

N Number of training samples 

Xj zth training sample 

X Single test sample 

di (z. j) Dissimilarity between training samples X, and Xj using Zth 
criterion 

Dij Dyad between training samples X, and Xj 

V Set of all dyads between training samples 

Ti Pareto front i of dyads between training samples 

M Total number of Pareto fronts on dyads between training 

samples 

Dt Dyad between training sample X , and test sample X 

d Pareto depth of dyad V), between training sample A', and 

test sample X 

ki Number of nearest neighbors in criterion l 

s Total number of nearest neighbors (over all criteria) 

v {X ) Anomaly score of test sample X 


data samples is available. Given a test sample X, the objective 
of anomaly detection is to declare X to be an anomaly if X 
is significantly different from samples in X N . Suppose that 
I\ > 1 different evaluation criteria are given. Each criterion 
is associated with a measure for computing dissimilarities. 
Denote the dissimilarity between Xj and Xj computed using 
the dissimilarity measure corresponding to the /th criterion by 
di(i,j). Note that need not be a metric; in particular 

it is not necessary that di(i,j) be a distance function over the 
sample space or that di(i,j) satisfy the triangle inequality. 

We define a dyad between a pair of samples i £ 
{1,..., N} and j £ {1,..., N} \ i by a vector D, 3 = 
[di(i, j ),..., dx (*, j)] T £ R+- There are in total ('^) dif¬ 
ferent dyads for the training set. For convenience, denote the 
set of all dyads by V. By the definition of strict dominance 
in Section [HI] a dyad D. rj strictly dominates another dyad 
D t *j* if di(i,j) < di(i*,j*) for all l £ {1 , ...,/V} and 
di(i,j) < di(i*,j*) for some l. The first Pareto front T\ 
corresponds to the set of dyads from V that are not strictly 
dominated by any other dyads from V. The second Pareto 
front J ~2 corresponds to the set of dyads from V\F\ that are 
not strictly dominated by any other dyads from T>\ T±, and 
so on, as defined in Section [HI] Recall that we refer to T, as 
a deeper front than Tj if i > j. 


d -1 
Ad—2 


d-l\ 


c n ,d + o((log n) 

< E(I< n - L n ) < (1 - §i) c nA + o((logn) d_1 ), 


The |Scalarization Gap Theorem shows that the fraction of 
Pareto-optimal points that cannot be obtained by linear scalar¬ 
ization is at least ‘V 1 , . We provide experimental evidence 


supporting these bounds in Section V-A 


IV. Multi-criteria anomaly detection 

We now formally define the multi-criteria anomaly detection 
problem. A list of notation is provided in Table [I] for reference. 
Assume that a training set Xjq = {X \,..., ATy} of unlabeled 


A. Pareto fronts on dyads 

For each training sample X,, there are N — 1 dyads 
corresponding to its connections with the other N — 1 training 
samples. If most of these dyads are located at shallow Pareto 
fronts, then the dissimilarities between Xj and the other TV — 1 
training samples are small under some combination of the 
criteria. Thus, Xj is likely to be a nominal sample. This is 
the basic idea of the proposed multi-criteria anomaly detection 
method using PDA. 

We construct Pareto fronts E\,. Em of the dyads from 
the training set, where the total number of fronts M is the 
required number of fronts such that each dyad is a member of a 
front. When a test sample X is obtained, we create new dyads 
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corresponding to connections between X and training samples, 
as illustrated in Fig. [I] Like with many other similarity-based 
anomaly detection methods, we connect each test sample to 
its k nearest neighbors, k could be different for each criterion, 
so we denote ki as the choice of k for criterion l. We create 
s = Y^i =1 ki new dy a ds Di, -D 2 ,..., D s , corresponding to 
the connections between X and the union of the ki nearest 
neighbors in each criterion l. In other words, we create a dyad 
between test sample X and training sample X, if X, is among 
the ki nearest neighbor^] of X in any criterion l. We say that 
Di is below a front Tj if Di >- D for some D £ Tj, i.e. Di 
strictly dominates at least a single dyad in Tj. Define the 
Pareto depth of I), by 

ei = min{j | Di is below Tj}. 

Therefore if e, is large, then I), will be near deep fronts, 
and the distance between X and X, will be large under all 
combinations of the K criteria. If e,; is small, then I), will be 
near shallow fronts, so the distance between X' and X, will 
be small under some combination of the K criteria. 


Training phase: 

1: for l = 1 —» K do 

2 : Calculate pairwise dissimilarities di{i,j) between all 

training samples X, and X 3 

3: Create dyads Dij = \d\(i, j),... ,dK{i,j )] for all training 
samples 

4: Construct Pareto fronts on set of all dyads until each dyad 
is in a front 
Testing phase: 

1: nb 4- [] {empty list} 

2: for l = 1 —» K do 

3: Calculate dissimilarities between test sample X' and all 

training samples in criterion l 
4: nbi 4 — ki nearest neighbors of X 

5: nb 4— [nb,nbi\ {append neighbors to list} 

6 : Create s = Yli -1 ki new dyads Di between X and 
training samples in nb 
7: for i = 1 —>• s do 
8 : Calculate depth e* of Di 

9: Declare X an anomaly if v(X) = (1/s) Ylt= 1 e i> P 


B. Anomaly detection using Pareto depths 

In fc-NN based anomaly detection algorithms such as those 
mentioned in Section |II-B[ the anomaly score is a function 
of the k nearest neighbors to a test sample. With multiple 
criteria, one could define an anomaly score by scalarization. 
From the probabilistic properties of Pareto fronts discussed 
in Section III-B| we know that Pareto optimization methods 
identify more Pareto-optimal points than linear scalarization 
methods and significantly more Pareto-optimal points than a 
single weight for scalarizatiorjf] 

This motivates us to develop a multi-criteria anomaly score 
using Pareto fronts. We start with the observation from Fig. |T] 
that dyads corresponding to a nominal test sample are typically 
located near shallower fronts than dyads corresponding to an 
anomalous test sample. Each test sample is associated with 
s = YI 1 L 1 ki new dyads, where the ith dyad l), has depth e t . 
The Pareto depth ei is a multi-criteria dissimilarity measure 
that indicates the dissimilarity between the test sample and 
training sample i under multiple combinations of the criteria. 
For each test sample X, we define the anomaly score v(X) 
to be the mean of the a’s, which corresponds to the average 
depth of the s dyads associated with X, or equivalently, the 
average of the multi-criteria dissimilarities between the test 
sample and its s nearest neighbors. Thus the anomaly score 
can be easily computed and compared to a decision threshold 
p using the test 


v(X) = ~ e ‘ 


H 1 

^ (>■ 
Ho 


J If Xi is one of the ki nearest neighbors of X in multiple criteria, then 
multiple copies of the dyad Di are created. We have also experimented with 
creating only a single copy of the dyad and found very little difference in 
detection accuracy. 

2 Theorems^ and E] require i.i.d. samples, but dyads are not independent. 
However, there are 0[N 2 ) dyads, and each dyad is only dependent on O(JV) 
other dyads. This suggests that the theorems should also hold for the non- 
i.i.d. dyads as well, and it is supported by experimental results presented in 
Section |V-A| 


Fig. 3. Pseudocode for PDA anomaly detection algorithm. 


Recall that the |Scalarization Gap Theorem| provides bounds 
on the fraction of dyads on the first Pareto front that cannot 
be obtained by linear scalarization. Specifically, at least 
dyads will be missed by linear scalarization on average. 
These dyads will be associated with deeper fronts by linear 
scalarization, which will artificially inflate the anomaly score 
for the test sample, resulting in an increased false positive 
rate for any algorithm that utilizes non-negative linear com¬ 
binations of criteria. This effect then cascades to dyads in 
deeper Pareto fronts, which also get assigned inflated anomaly 
scores. We provide some evidence of this effect on a real data 
experiment in Section [V-C| Finally, the lower bound increases 
monotonically in K , which implies that the PDA approach 
gains additional advantages over linear combinations as the 
number of criteria increases. 

C. PDA anomaly detection algorithm 

Pseudocode for the PDA anomaly detector is shown in 
Fig. 0 The training phase involves creating ((f) dyads cor¬ 
responding to all pairs of training samples. Computing all 
pairwise dissimilarities in each criterion requires 0(mKN 2 ) 
floating-point operations (flops), where m denotes the number 
of dimensions involved in computing a dissimilarity. The 
time complexity of the training phase is dominated by the 
construction of the Pareto fronts by non-dominated sorting. 
Non-dominated sorting is used heavily by the evolutionary 
computing community; to the best of our knowledge, the 
fastest algorithm for non-dominated sorting was proposed 
by Jensen (29J and later generalized by Fortin et al. (30} 
and utilizes 0(N 2 log A_ 1 (X 2 )) comparisons. The complexity 
analyses in (29), (30} are asymptotic in N and assume K 
fixed. We are unaware of any analyses of its asymptotics in 
I \. Another non-dominated sorting algorithm proposed by Deb 
et al. [1311 constructs all of the Pareto fronts using Q(KN 4 ) 
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comparisons, which is linear in the number of criteria K 
but scales poorly with the number of training samples N. 
We evaluate how both approaches scale with K and N 


experimentally in Section V-B2 


The testing phase involves creating dyads between the test 
sample and the ki nearest training samples in criterion Z, 
which requires 0(mKN ) flops. For each dyad Di, we need 
to calculate the depth e*. This involves comparing the test 
dyad with training dyads on multiple fronts until we find a 
training dyad that is dominated by the test dyad, e* is the 
front that this training dyad is a part of. Using a binary 
search to select the front and another binary search to select 
the training dyads within the front to compare to, we need 
to make 0(K log 2 N) comparisons (in the worst case) to 
compute e,. The anomaly score is computed by taking the 
mean of the s ef s corresponding to the test sample; the score 
is then compared against a threshold p to determine whether 
the sample is anomalous. 


D. Selection of parameters 

The parameters to be selected in PDA are k\, ..., kx , which 
denote the number of nearest neighbors in each criterion. 

The selection of such parameters in unsupervised learning 
problems is very difficult in general. For each criterion l, we 
construct a fc/-NN graph using the corresponding dissimilarity 
measure. We construct symmetric ki- NN graphs, i.e. we con¬ 
nect samples i and j if i is one of the k/ nearest neighbors 
of j or j is one of the ki nearest neighbors of i. We choose 
ki = logN as a starting point and, if necessary, increase ki 
until the /;:/-NN graph is connected. This method of choosing 
ki is motivated by asymptotic results for connectivity in fc-NN 
graphs and has been used as a heuristic in other unsupervised 
learning problems, such as spectral clustering [ [32J . We find 
that this heuristic works well in practice, including on a real 
data set of pedestrian trajectories, which we present in Section 

Ec] 


V. Experiments 

We first present an experiment involving the scalarization 
gap for dyads (rather than Lid. samples). Then we compare 
the PDA method with five single-criterion anomaly detection 
algorithms on a simulated data set and a real data se0 The 
five algorithms we use for comparison are as follows: 

• kNN: distance to the fcth nearest neighbor ||4j|. 

• kNN sum: sum of the distances to the k nearest neighbors 

®, ©• 

• k-LPE: localized p-value estimate using the k nearest 
neighbors 0. 

• LOF: local density of the k nearest neighbors [[7). 

• 1-SVM: 1-class support vector machine [20]]. 

For these methods, we use linear combinations of the 
criteria with different weights (linear scalarization) to compare 
performance with the proposed multi-criteria PDA method. We 
find that the accuracies of the nearest neighbor-based methods 

3 The code for the experiments is available at http://tbayes.eecs.umich.edu/ 
coolmark/pda 



Fig. 4. Sample means for K n — L n versus number of dyads for dimension 
d = 2. Note the expected logarithmic growth. The dotted line indicates the 
curve of best fit y = 0.314 log n. 



Fig. 5. Sample means for {K n — L n )/c n ^d versus dimensio n for n = 100,128 
dyads. The upper and lower bounds established in the |Scalarization Gap] 
|Theoremj are given by the dotted lines in the figure. We see in the figure 
that the fraction of Pareto optimal points that are not obtainable by linear 
scalarization increases with dimension. 


do not vary much in our experiments for k = 3,... ,10. The 
results we report use k = 6 neighbors. For the 1-class SVM, 
it is difficult to choose a bandwidth for the Gaussian kernel 
without having labeled anomalous samples. Linear kernels 
have been found to perform similarly to Gaussian kernels 
on dissimilarity representations for SVMs in classification 
tasks (33); hence we use a linear kernel on the scalarized 
dissimilarities for the 1-class SVM. 


A. Scalarization gap for dyads 

Independence of Y-[,... ,Y rl is built into the assumptions of 
Theorems [T] and [2] and thus, the Scalarization Gap Theorem 


but it is clear that dyads (as constructed in Section IVI are not 
independent. Each dyad represents a connection between 
two independent training samples Xi and Xj. For a given dyad 
D,,j, there are 2{N — 2) corresponding dyads involving Xi or 
Xj, and these are clearly not independent from D,j . However, 
all other dyads are independent from D i:l . So while there 
are 0 {N 2 ) dyads, each dyad is independent from all other 


dyads except for a set of size 0{N). Since the Scalarization 


Gap Theorem is an asymptotic result, the above observation 


suggests it should hold for the dyads even though they are 
not i.i.d. In this subsection we present some experimental 




















results which suggest that the |Scalarization Gap Theorem| does 
indeed hold for dyads. 

We generate synthetic dyads here by drawing i.i.d. uniform 
samples in [ 0 , l ] 2 and then constructing dyads corresponding 
to the two criteria |Ax| and |Aj/|, which denote the absolute 
differences between the x and y coordinates, respectively. The 
domain of the resulting dyads is again the box [0, l] 2 . In this 
case, the Scalarization Gap Theorem suggests that E(K n —L n ) 
should grow logarithmically. Fig. [4 shows the sample means 
of K n — L n versus number of dyads and a best fit logarithmic 
curve of the form y = a log n, where n = (^) denotes the 
number of dyads. We vary the number of dyads between 10 6 
to 10 9 in increments of 10 6 and compute the size of K n — 
L n after each increment. We compute the sample means over 
1,000 realizations. A linear regression on y/ log n versus log n 
gives a = 0.314, which falls in the range specified by the 


Scalarization Gap Theorem 

We next explore the dependence of K n — L n on the 
dimension d. Here, we generate 100,128 dyads (corresponding 
to N = 448 points in [0, l] d ) in the same way as before, for 
dimensions d = 2,... ,7. The criteria in this case correspond 
to the absolute differences in each dimension. In Fig. [5] we plot 
E(K n — L n )/c n ^d versus dimension to show the fraction of 
Pareto-optimal points that cannot be obtained by scalarization. 
Recall from Theorem Q] that 


E(Kn) 


(logn) d 1 

(d- 1 )! 


as n —> oo. 


Based on the figure, one might conjecture that the fraction of 
unattainable Pareto optimal points converges to 1 as d —> oo. If 
this is true, it would essentially imply that linear scalarization 
is useless for identifying dyads on the first Pareto front when 
there are a large number of criteria. As before, we compute 
the sample means over 1,000 realizations of the experiment. 


B. Simulated experiment with categorical attributes 

In this experiment, we perform multi-criteria anomaly de¬ 
tection on simulated data with multiple groups of categorical 
attributes. These groups could represent different types of 
attributes. Each data sample consists of I\ groups of 20 
categorical attributes. Let A i: j denote the jth attribute in group 
i, and let n vi denote the number of possible values for this 
attribute. We randomly select between 6 and 10 possible values 
for each attribute with equal probability independent of all 
other attributes. Each attribute is a random variable described 
by a categorical distribution, where the parameters qi, ■ ■ ■ ,q ni 
of the categorical distribution are sampled from a Dirichlet 
distribution with parameters a%,... , a ni . For a nominal data 
sample, we set a\ = 5 and a 2 , • • •, ot-m = 1 for each attribute 
j in each group i. 

To simulate an anomalous data sample, we randomly select 
a group i with probability p, for which the parameters of the 
Dirichlet distribution are changed to ot\ = ■ ■ ■ = a ni . = 1 
for each attribute j in group i. Note that different anomalous 
samples may differ in the group that is selected. The p,’s are 
chosen such that Pi/pj = i/j with Pi = 0.5, so that 
the probability that a test sample is anomalous is 0.5. The 
non-uniform distribution on the pd s results in some criteria 



Fig. 6. AUC of PDA compared to AUCs of single-criterion methods for 
simulated experiment. The single-criterion methods use 600 randomly sam¬ 
pled weights for linear scalarization, with weights ordered from worst choice 
of weights (left) to best choice (right) in terms of maximizing AUC. The 
proposed PDA algorithm is a multi-criteria algorithm that does not require 
selecting weights. PDA outperforms all of the single-criterion methods, even 
for the best choice of weights, which is not known in advance. 


TABLE II 

Comparison of AUCs for simulated experiment. Best performer 
UP TO one standard error is shown in bold. PDA does not use 

WEIGHTS SO IT HAS A SINGLE AUC. MEDIAN AND BEST AUCS OVER ALL 
CHOICES OF WEIGHTS ARE SHOWN FOR THE OTHER METHODS. 


AUC by weight 
Median Best 


PDA 



0.885 ± 0.002 



k-NN 

0.749 

± 

0.002 

0.872 

± 

0.002 

k-NN sum 

0.747 

± 

0.002 

0.870 

± 

0.002 

k-LPE 

0.744 

± 

0.002 

0.867 

± 

0.002 

LOF 

0.749 

± 

0.002 

0.859 

± 

0.002 

1-SVM 

0.757 

± 

0.002 

0.873 

± 

0.002 


being more useful than others for identifying anomalies. 
The K criteria for anomaly detection are taken to be the 
dissimilarities between data samples for each of the K groups 
of attributes. For each group, we calculate the dissimilarity 
over the attributes using a dissimilarity measure for anomaly 
detection on categorical data proposed in m 

We draw 400 training samples from the nominal distribution 
and 400 test samples from a mixture of the nominal and 
anomalous distributions. For the single-criterion algorithms, 
which we use as baselines for comparison, we use linear 
scalarization with multiple choices of weights. Since a grid 
search scales exponentially with the number of criteria K 
and is computationally intractable even for moderate values 
of K, we instead uniformly sample 100 A weights from the 
(K — l)-dimensional simplex. In other words, we sample 
100/0 weights from a uniform distribution over all convex 
combinations of K criteria. 

1) Detection accuracy: The different methods are evaluated 
using the receiver operating characteristic (ROC) curve and 
the area under the ROC curve (AUC). We first fix the number 
of criteria K to be 6 . The mean AUCs over 100 simulation 
runs are shown in Fig. [ 6 ] Multiple choices of weights are 

4 We obtain similar results with several other dissimilarity measures for 
categorical data, including the Goodall2 and IOF measures described in the 
survey paper by Boriah et al. [34[. 
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Fig. 7. The ratio of the AUC for PDA compared to the best and median 
AUCs of scalarization using LOF as the number of criteria K is varied in 
the simulated experiment. lOOif choices of weights uniformly sampled from 
the (K — 1)-dimensional simplex are chosen for scalarization. PDA perfoms 
significantly better than the median over all weights for all K. For K > 4, 
PDA outperforms the best weights for scalarization, and the margin increases 
as K increases. 


used for linear scalarization for the single-criterion algorithms; 
the results are ordered from worst to best weight in terms 
of maximizing AUC. kNN, kNN sum, and k-LPE perform 
roughly equally so only kNN is shown in the figure. Table [11] 
presents a comparison of the AUC for PDA with the median 
and best AUCs over all choices of weights for scalarization. 
Both the mean and standard error of the AUCs over the 
100 simulation runs are shown. Notice that PDA outperforms 
even the best weighted combination for each of the five 
single-criterion algorithms and significantly outperforms the 
combination resulting in the median AUC, which is more 
representative of the performance one expects to obtain by 
arbitrarily choosing weights. 

Next we investigate the performance gap between PDA and 
scalarization as the number of criteria K varies from 2 to 
10. The performance of the five single-criterion algorithms is 
very close, so we show scalarization results only for LOF. 
The ratio of the AUC for PDA to the AUCs of the best and 
median weights for scalarization are shown in Fig. [7] PDA 
offers a significant improvement compared to the median over 
the weights for scalarization. For small values of K, PDA 
performs roughly equally with scalarization under the best 
choice of weights. As K increases, however, PDA clearly 
outperforms scalarization, and the gap grows with K. We be¬ 
lieve this is partially due to the inadequacy of scalarization for 
identifying Pareto fronts as described in the |Scalarization Gap| 


Theorem and partially due to the difficulty in selecting optimal 


weights for the criteria. A grid search may be able to reveal 
better weights for scalarization, but it is also computationally 
intractable for large K. Thus we conclude that PDA is clearly 
the superior approach for large K. 

2) Computation time: We evaluate how the computation 
time of PDA scales with varying K and TV using both the non- 
dominated sorting procedures of Fortin et al. f30| (denoted 
by PDA-Fortin) and Deb et al. J3TJ (denoted by PDA-Deb) 
discussed in Section IV-C The time complexity of the testing 
phase is negligible compared to the training phase so we 



Number of training samples 


+ PDA-Deb 

— PDA-Deb best fit (a = 4.6) 
PDA-Fortin 

PDA-Fortin best fit (a = 2.2) 
□ k-LPE 

-k-LPE best fit (a = 1.8) 

O 1-SVM 

1-SVM best fit (a = 2.8) 


Fig. 8. Normalized computation time as a function of the number of training 
samples N in the simulated experiment. Best fit curves are of the form y = 
/3N a . The best fit curve for 1-SVM is extrapolated beyond 5,624 samples, 
and the best fit curve for PDA-Deb is extrapolated beyond 563 samples. 




(a) PDA-Deb (b) PDA-Fortin 

Fig. 9. Normalized computation time as a func tion of the number of criteria 
K in the simulated experiment. PDA-Deb |(a)| appears to be linear in K as 
predicted. PDA-Fortin |(b)| initially appears to be exponential in K but the 
computation time does not continue to increase exponentially beyond K = 7. 


measure the computation time required to train the PDA 
anomaly detector. 

We first fix K = 2 and measure computation time for TV 
uniformly distributed on a log scale from 100 to 10,000. 
Since the actual computation time depends heavily on the 
implementation of the non-dominated sorts, we normalize 
computation times by the time required to train the anomaly 
detector for TV = 100 so we can observe the scaling in TV. 

The normalized times for PDA as well as k-LPE and 
1-SVM are shown in Fig. [8] Best fit curves of the form 
y = /3N a are also plotted, with a and /3 estimated by linear 
regression. PDA-Deb has time complexity of 0(KN 4 ), and 
the estimated exponent a = 4.6. Of the four algorithms, it 
has the worst scaling in TV. PDA-Fortin has time complexity 
of 0(TV 2 log A_1 (iV 2 )), and the estimated exponent a = 2.2, 
confirming that it scales much better than PDA-Deb and is 
applicable to large data sets. k-LPE is representative of the 
k-NN algorithms and has time complexity of 0(TV 2 ); the 
estimated exponent a = 1.8. It is difficult to determine the 
time complexity of 1-SVM due to its iterative nature. The 
estimated exponent for 1-SVM is a = 2.8, suggesting that it 
scales worse than PDA-Fortin. 

Next we fix TV = 400 and measure computation time for 
K varying from 2 to 10. We normalize by the time required 
to train the anomaly detector for K = 2 to observe the 
scaling in K. The normalized time for PDA-Deb is shown 
in Fig. [9a] along with a best fit line of the form y = aK. 
The normalized time does indeed appear to be linear in K 
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(a) (b) 

Fig. 10. |(a)| Some anomalous pedestrian trajectories detected by PDA. |(b)| 
Trajectories with relatively low anomaly scores. The two criteria used are 
walking speed and trajectory shape. Anomalous trajectories could have 
anomalous speeds or shapes (or both), so some anomalous trajectories may 
not look anomalous by shape alone. 


and grows slowly. The normalized time for PDA-Fortin is 
shown in Fig. 9b along with a best fit curve of the form 
y = a/3 K fit to A' = 2,... ,7. Notice that the computation 
time initially increases exponentially but increases at a much 
slower, possibly even linear, rate beyond K = 7. The analyses 

are asymptotic in N and 


of time complexity from |29|, 
assume K fixed; we are not aware of any analyses of time 
complexity asymptotic in K. Our experiments suggest that 
PDA-Fortin is computationally tractable for non-dominated 
sorting in PDA even for large K. Finally we note that the 
scaling in K for scalarization methods is trivial, depending 
simply on the number of choices for scalarization weights, 
which is exponential for a grid search. 



Fig. 11. AUC of PDA compared to AUCs of single-criterion methods for 
the pedestrian trajectories experiment. The single-criterion methods use linear 
scalarization with 100 uniformly spaced weights; weights are ordered from 
worst (left) to best (right) in terms of maximizing AUC. PDA outperforms 
the single-criterion methods for almost all choices of weights. 


TABLE El 

Comparison of AUCs for pedestrian trajectories experiment. 
Best performer up to one standard error is shown in bold. 


AUC by weight 
Median Best 


PDA 



0.944 ± 0.002 



k-NN 

0.902 

± 

0.002 

0.918 

± 

0.002 

k-NN sum 

0.901 

± 

0.003 

0.924 

± 

0.002 

k-LPE 

0.892 

± 

0.003 

0.917 

± 

0.002 

LOF 

0.754 

± 

0.011 

0.952 

± 

0.003 

1-SVM 

0.679 

± 

0.011 

0.910 

± 

0.003 


C. Pedestrian trajectories 

We now present an experiment on a real data set that 
contains thousands of pedestrians’ trajectories in an open area 
monitored by a video camera (55) , We represent a trajectory 
with p time samples by 


Xi x 2 x p 

yi V2 ■■■ iip 


where [x t , yt] denote a pedestrian’s position at time step t. The 
pedestrian trajectories are of different lengths so we cannot 
simply treat the trajectories as vectors in W and calculate 
Euclidean distances between them. Instead, we propose to cal¬ 
culate dissimilarities between trajectories using two separate 
criteria for which trajectories may be dissimilar. 

The first criterion is to compute the dissimilarity in 
walking speed. We compute the instantaneous speed at 
all time steps along each trajectory by finite differencing, 
i.e. the speed of trajectory T at time step t is given by 
y/(x t — x t ~ i) 2 + (yt — yt-i) 2 . A histogram of speeds for 
each trajectory is obtained in this manner. We take the 
dissimilarity between two trajectories S and T to be the 
Kullback-Leibler (K-L) divergence between the normalized 
speed histograms for those trajectories. K-L divergence is 
a commonly used measure of the difference between two 
probability distributions. The K-L divergence is asymmetric; 
to convert it to a dissimilarity we use the symmetrized K- 
L divergence Dkl(S\\T) + Dkl(T\\S) as originally defined 


by Kullback and Leibler |36) . We note that, while the sym¬ 
metrized K-L divergence is a dissimilarity, it does not, in 
general, satisfy the triangle inequality and is not a metric. 

The second criterion is to compute the dissimilarity in 
shape. To calculate the shape dissimilarity between two trajec¬ 
tories, we apply a technique known as dynamic time warping 
(DTW) 137j, which first non-linearly warps the trajectories in 
time to match them in an optimal manner. We then take the 
dissimilarity to be the summed Euclidean distance between the 
warped trajectories. This dissimilarity also does not satisfy the 
triangle inequality in general and is thus not a metric. 

The training set for this experiment consists of 500 ran¬ 
domly sampled trajectories from the data set, a small fraction 
of which may be anomalous. The test set consists of 200 
trajectories (150 nominal and 50 anomalous). The trajectories 
in the test set are labeled as nominal or anomalous by a 
human viewer. These labels are used as ground truth to 
evaluate anomaly detection performance. Fig. [TO] shows some 
anomalous trajectories and nominal trajectories detected using 
PDA. 


We run the experiment 20 times; for each run, we use a 
different random sample of training trajectories. Fig. 11 shows 
the performance of PDA as compared to the other algorithms 
using 100 uniformly spaced weights for convex combinations. 
Notice that PDA has higher AUC than the other methods for 
almost all choices of weights for the two criteria. The AUC for 
PDA is shown in Table [TIT] along with AUCs for the median and 
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False positive rate 



Fig. 12. ROC curves for PDA and attainable region for LOF over 100 choices 
of weights for one run of the pedestrian trajectories experiment. The attainable 
region denotes the possible ROC curves for LOF corresponding to different 
choices of weights for linear scalarization. The ROCs for linear scalarization 
vary greatly as a function of the weights. 


Fig. 14. AUCs for different choices of [fci,fc 2 ] in the pedestrian trajectories 
experiment. The AUC for the parameters chosen using the proposed heuristic 
[ki = 6, &2 — 7] is within 0.001 of the maximum AUC obtained by the 
parameters [k± = 5, &2 =4]. 



Fig. 13. Comparison of a Pareto front (solid red line) on dyads (gray dots) 
with convex fronts (blue dashed lines) obtained by linear scalarization. The 
dyads towards the middle of the Pareto front are found in deeper convex fronts 
than those towards the edges. The result would be inflated anomaly scores 
for the samples associated with the dyads in the middle of the Pareto fronts 
when using linear scalarization. 


best choices of weights for the single-criterion methods. The 
mean and standard error over the 20 runs are shown. For the 
best choice of weights, LOF is the single-criterion method with 
the highest AUC, but it also has the lowest AUC for the worst 
choice of weights. For a more detailed comparison, the ROC 
curve for PDA and the attainable region for LOF (the region 
between the ROC curves corresponding to weights resulting 
in the best and worst AUCs) is shown in Fig. [12] Note that 
the ROC curve for LOF can vary significantly based on the 
choice of weights. The ROC for 1-SVM also depends heavily 
on the weights. In the unsupervised setting, it is unlikely that 
one would be able to achieve the ROC curve corresponding to 
the weight with the highest AUC, so the expected performance 
should be closer to the median AUCs in Table [HI] 

Many of the Pareto fronts on the dyads are non-convex, 
partially explaining the superior performance of the proposed 
PDA algorithm. The non-convexities in the Pareto fronts 
lead to inflated anomaly scores for linear scalarization. A 
comparison of a Pareto front with two convex fronts (obtained 


by scalarization) is shown in Fig. 13 The two convex fronts 
denote the shallowest and deepest convex fronts containing 
dyads on the illustrated Pareto front. The test samples asso¬ 
ciated with dyads near the middle of the Pareto fronts would 
suffer the aforementioned score inflation, as they would be 
found in deeper convex fronts than those at the tails. 

Finally we note that the proposed PDA algorithm does not 
appear to be very sensitive to the choices of the number of 
neighbors, as shown in Fig. fl4| In fact, the heuristic proposed 
for choosing the fc;’s in Section IV-D performs quite well in 
this experiment. Specifically, the AUC obtained when using 
the parameters chosen by the proposed heuristic is very close 
to the maximum AUC over all choices of the number of 
neighbors [fci,fc 2 ]- 


VI. Conclusion 

In this paper we proposed a method for similarity-based 
anomaly detection using a novel multi-criteria dissimilarity 
measure, the Pareto depth. The proposed method utilizes the 
notion of Pareto optimality to detect anomalies under multiple 
criteria by examining the Pareto depths of dyads corresponding 
to a test sample. Dyads corresponding to an anomalous sample 
tended to be located at deeper fronts compared to dyads corre¬ 
sponding to a nominal sample. Instead of choosing a specific 
weighting or performing a grid search on the weights for 
dissimilarity measures corresponding to different criteria, the 
proposed method can efficiently detect anomalies in a manner 
that is tractable for a large number of criteria. Furthermore, the 
proposed Pareto depth analysis (PDA) approach is provably 
better than using linear combinations of criteria. Numerical 
studies validated our theoretical predictions of PDA’s perfor¬ 
mance advantages compared to using linear combinations on 
simulated and real data. 

An interesting avenue for future work is to extend the PDA 
approach to extremely large data sets using approximate, rather 
than exact, Pareto fronts. In addition to the skyline algo¬ 
rithms from the information retrieval community that focus 
on approximating the first Pareto front, there has been recent 
work on approximating Pareto fronts using partial differential 
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equations |38]| that may be applicable to the multi-criteria 
anomaly detection problem. 

Appendix 

The proofs for Theorems [I] and [2] are presented after some 
preliminary results. We begin with a general result on the 
expectation of K n . Let F : [0, l\ d —> R denote the cumulative 
distribution function of /, defined by 

rx i nx d 

F(x)= •••/ f(yi,... ,yd) dyi ■ ■ ■ dy d . 

Jo Jo 

Proposition 1. For any n > 1 we have 

E{K n ) = n f /(ir) (1 - F{x)) n ~ l dx. 

J[ o,t] d 

Proof. Let F, be the event that Y t £ T" and let Xe, be 
indicator random variables for Ei. Then 

( n \ n 

=E p (^) = nP ^)- 

Conditioning on Y\ we obtain 

E(K n ) = n [ f(x)P{E 1 | Yi = x)dx. 

J\o.i\ d 


Noting that P(E\ \ Y\ = x) = (1 — F(x)) n 1 completes the 
proof. □ 

The following simple proposition is essential in the proofs 
of Theorem Q] and [ 2 ] 

Proposition 2. Let 0 < S < 1 and a > 0. For a < 5~ d we 
have 

n ( (1 — ax\ ■ ■ ■ Xd) n ~ l dx = 0 ((logn) d ^ 2 ), CD 

J[o,S] d a 

and for a < 1 we have 

n f (1 — axi ■ ■ ■ Xd) n ~ l dx = 0 ((logn) d_2 ). ( 2 ) 

J [0,l]' J \[0,(5] d 

Proof We will give a sketch of the proof as similar results 
are well-known [ 251. Assume <5=1 and let Q n denote the 
quantity on the left hand side of <|TJ. Making the change of 
variables yi = Xi for i = 1,..., d — 1 and t = x i • • • x d , we 
see that 


1 /*! r i 


Qn — El 


/0 Jt 


(1 — at) 


n—1 


! _*_ yi---yd-i 

yz-vd-i 


dy i • • • dy d -idt. 


By computing the inner d — 1 integrals we find that 


Qn — 


n 


- j (-log t) d 1 (1 — at) n l dt , 


We can now apply the above result provided aS d < 1. The 
asymptotics in 0 show that 

n / (1 — axi ■■■Xd) n ~ 1 dx 

J[o,i] d 

= n f (1 — ax\ ■ ■ ■ x d ) n ~ 1 dx + 0({}ogn) d ~ 2 ), 
J[o,s] d 

when a < 1 , which gives the second result | 2 ]). □ 

We now give the proof of Theorem |T] 

Proof Let e > 0 and choose <5 > 0 such that 

/(o) - £ < f(x) < /( 0 ) + e for any x £ [ 0 , <5] d , 

and /(0) < S~ d . Since a < f < M. we have that F(x) > 
ox± ■ ■ ■ x d for all x £ [0, l] d . Since / is a probability density 
on [0, l] d , we must have er < 1. Since a > 0, we can apply 
Proposition [2] to find that 

n f f(x)(l — F(x)) n ~ 1 dx 

J [0,l] d \[0,5] d 

< Mn / (1 — axi ■ ■ ■ Xd) 11 - 1 dx 

J [0,1] d \[0,5] d 

= 0((logn) d - 2 ). (3) 

For x £ [0,<5] d , we have 

(/(0) - e)xi ■■■x d < F(x) < (/(0) + e)xi ■■■x d . 

Combining this with Proposition [2] and the fact that /(())—£ < 
S~ d we have 

n f f(x)(l — F(x)) n ~ 1 dx 
J[Q,S] d 

< (/( 0 ) + e)n [ (1 - (/( 0 ) - e)x\ ■ ■ ■ a: d ) n ~ 1 dx 


m+e 


c n , d + 0((\ogn) d 2 ). 


m-e 

Combining ([3| and ([4]) with Proposition (|T]» we have 

E[K n ) < ■ c n , d + 0((log n) d ~ 2 ). 


(4) 


/(0) - e 


It follows that 


limsup c n \E{K n ) < 


(d-l)'.jo 

from which the asymptotics 0 can be easily obtained by 
another change of variables u = nat, provided a < 1. For 
0 < 6 < 1 , we make the change of variables y = x/S to find 
that 

Qn = 5 d n / (1 - a8 d yi ■ ■ ■ y d ) n_1 dy. 

i[o,i] d 


/(0) +e 

n—► oo ■ /(O)-e' 

By a similar argument we can obtain 

‘“S' C n4 E (En) > FFf- 

Since £ > 0 was arbitrary, we see that 

lim c n 1 d E{K n ) = 1- 


□ 


The proof of Theorem [2] is split into the following two 
lemmas. It is well-known, and easy to see, that x £ C n if 
and only if x £ T n , and x is on the boundary of the convex 
hull of Q n |23) . This fact will be used in the proof of Lemma 

ffl 
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' [0,1] d \[0,<5] d 


<n — F(x)) n 1 dx 

J [0,1] d \[0,<5] d 

= 0((logn) d - 2 ), 


and hence 


Xi 


Fig. 15 for an illustration of these sets for d = 3. 


Fig. 15. Depiction of the sets Bi,B 2 and B 3 from the proof of Lemma [I] in 
the case that d = 3. 


Lemma 1. Assume f : [0, \\ d —> R. is continuous at the origin 
and there exists a, M > 0 such that a < / < M. Then 

E{L n ) < _ 2 ' c n,d + o((logn) d ~ 1 ) as n ->• oo. 

Proof. Let e > 0 and choose 0 < S < ^ so that 

/(0) - e < /(x) < /(0) + e for any x £ [0, 25} d , (5) 

and 3/(0) < S~ d . As in the proof of Proposition [I] we have 
E(L n ) = nP(Y-[ £ C n ), so conditioning on Y\ we have 

E(L n ) = n [ f(x)P(Y 1 £ C n | Yi = x) dx. 

0,l] d 

As in the proof of Theorem [l] we have 

f(x)P(Y 1 £ C n | Yi = x) dx 


E(L n ) = n f f(x)P(Yi £ C n | Yi = x) dx 
J[o ,s] d 

+ 0((\ogn) d ~ 2 ). (6) 


Fix x £ [0, and define A = {y £ [0, l] d : Vi, yi < xf\ 

and 

Bi = \ y £ [0, l] d : Mj ± i, yj < Xj 


and Xi <y z < 2 xi --y, >, 

x j J 

for i = 1 ,...,d, and note that Bi C [0, 25} d for all i. See 


5 = ( 2xi - —z 2 , z 2 , x 3 ,...,Xd) ■ 

V Z2 J 

By the definitions of B\ and B 2 we see that yi < yi and 
Zi < Zi for all i, hence y,z £ Q n . Let a £ (0,1) such that 

ayi + (1 - a) ( 2xi - —z 2 ) = Xi. 

V x 2 ) 

A short calculation shows that x = cty+(l— a)z which implies 
that x is in the interior of the convex hull of Q n , proving the 
claim. 

Let E denote the event that at most one of //..... If/ 
contains a sample from Y 2 ,..., Y n , and let F denote the event 
that A contains no samples from Y 2 ,... ,Y n . Then by the 
observation above we have 

P(Y\ £ C n | Yi =x)< P(ECiF | Yi = x) = P(EnF). (7) 

For i = 1,... ,d, let Ei denote the event that 11, contains no 
samples from Y 2 ,..., Y n . It is not hard to see that 

£ =u (n^\n^)ufn^ 

i=1 j ) \ j 

Furthermore, the events in the unions above are mutually 
exclusive (disjoint) and PijEj C Hj^iEj for i = l,...,d. 
It follows that 

P{E IT F) 

d 

= ^2 (P {rijjtiEj nF)~ p (i DjEj n f)) + p (rijEj n f) 

i— 1 
d 

= J2 P r\F)-{d- l )P (fijEj n F) 

i— 1 

d / \ n_1 

= ( 1 _ F ( x ) _ [ f(y) d v) 

i = 1 \ ) 

1 - F(x) - f f(y) dy ] . (8) 

J 

A simple computation shows that \Bj\ = ' x- t ■ ■ ■ x,/ for j = 
1,..., d. Since A, Bi C [0, 26] d , we have by © that 

(/(0) - e)xi ■■■x d < F(x) < (/(0) + e)xi ■ ■ • x d , 

and 

^(/( 0 ) -e)xi • • -x d < j f(y) dy < ^(/( 0) + e)xi ■ ■ ■ x d . 
J Bj 

Inserting these into (|8]i and combining with (|7]i we have 

P(Yr € C n | Yi = x) 

2c? — 1 


<d 1- 


-(/( 0 ) -£)xi---x d 


n—1 


n —1 


We claim that if at least two of B \,..., B d contain samples 
from Y 2 ,...,Y„, and Yi = x, then Y± f CP. To see 
this, assume without loss of generality that B\ and B 2 are 
nonempty and let y £ B\ and z £ B 2 . Set 

y = (yi,2 x 2 - — yi,x 3 , ...,x d 

V Xl 


- {d - 1) (1 - 2(/(0) + e)xi ■■■x d ) 

We can now insert this into <[6| and apply Proposition [2] (since 
3/(0) < 5~ d ) to obtain 

d2 f(°)+ £ rf-l/( 0 )-e N _ 

P(Pn) — [ r, 7 1 /*/n\ r, /-/r,\ , ) £ n.d 


2 d-l/( 0 )-e 2 f{0) + E/ 

+ 0((log n) d ~ 2 ). 
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Since e > 0 was arbitrary, we find that 

, d 2 d — 1 

lim sup c n d E(L n ) < 


2d-l 


3d — 1 
Ad- 2' 


Lemma 2. Assume f : [0, l] d —>• K is continuous and there 
exists a,M >0 such that a < / < M. 7Vzen 

-E(^n) > ^ • c n ,d + o((logn) d_1 ) as n -)• oo. 

Proof. Let £ > 0 and choose 0 < c) < 1/d so that 

/(0) -e </(x) < /(0) + er for xe[0,d<5] d , (9) 


and 

<i d 

“,(/(o) + £)<r d . (io) 

As in the proof of Lemma [l] we have 

E(L n ) = n f f(x)P(Yi e C n \Yi=x) dx 

d [0,5] d 

+ 0 ((log n) d ~ 2 ). (11) 

Fix a; € (0, <5) d , set i/ = (i,..., ffj and 

A = {y £ [0, l] d | yv<x - v). 

Note that A is a simplex with an orthogonal corner at the origin 
and side lengths d-x i,..., d-x d . A simple computation shows 
that | A\ = • • • Xd■ By ([£]) we have 

f(y ) dy < (/( 0 ) + e)|A| = ^-(/( 0 ) + e)x l ■ ■ ■ x d . 

It is easy to see that if A is empty and Y\ = x then Y\ £ C n , 
hence 

P(Yi &C n \Y 1 =x)> (l - j A f(y) dy^j 

( d d \ n ~ 1 

> f 1 - ^j-(/( 0 ) +e)xi ■■■x d J 

Inserting this into {n} and noting ( flQ| , we can apply Propo¬ 
sition U to obtain 

E(Ln) > ^jf|^c n , d + 0 ((logn) d - 2 ), 

and hence ^ 

lim sup cf d E(L n ) > —. 0 

n—> oo Cl 

Theorem [2] is obtained by combining Lemmas [I] and [2] 
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